Caio Raphael

Shader Branching

Math is faster than branching with control flow.
.
.
.

AZDO (Approach Zero Driver Overhead)

Motivation

GPUs have orders of magnitude higher performance than CPU on data-parallel algorithms. Rendering is almost all data parallel algorithms.
With the GPU deciding its own work, latencies are minimized as there is no roundtrip from CPU to GPU and back.
Frees up the CPU from a lot of work which can now be used on other things.
Repeated small vkCmdBindVertexBuffers / vkCmdBindIndexBuffer and vkCmdDrawIndexed calls force the driver to record many small draw commands (and possibly do relocations or check memory), which is expensive.
Many small state changes cause the driver to update internal tables, validate, or patch commands — that’s CPU work and cannot be avoided without batching.
Different drivers / GPUs behave differently. Some drivers do more CPU work per bind/draw and will be worse in this pattern.
Query commands from a GPU buffer. This has two significant advantages:
- Draw calls can be generated from the GPU (such as in a "compute shader"), and
- An array of draw calls can be called at once, reducing command buffer overhead

Optimizations

Batch vkCmdDrawIndexed : Multi-draw Indirect.
- vkCmdDrawIndexedIndirect / multi-draw indirect to let GPU consume a small indirect buffer with many draws in one driver call.
- Avoid binding vertex/index buffer per draw — bind the quad vertex/index buffer once and supply per-instance data via instance attributes or an SSBO.
- See Vulkan.md#Drawing Commands .
Batch vkCmdPushConstants .
- (2025-12-01) From 5 push calls taking 7.65us, to now 1 push call taking 3.08us.
Batch vkCmdBindDescriptorSets : using Bindless.
- Bindless Textures.
- Bindless / descriptor indexing ( VK_EXT_descriptor_indexing / descriptor arrays with update-after-bind) so you can bind a single descriptor set containing all textures and index into it in the shader using the per-instance texture index. This removes the per-draw vkCmdBindDescriptorSets .
- Binding a descriptor set per draw is heavy if each set references different image samplers; the driver must ensure the GPU has correct descriptors ready (or patch them).
- See Vulkan.md#Descriptor Sets .
Batch textures: Texture Atlas.
- Pack multiple textures into atlases or a texture array and index into them.
Pre-record commands.
- Useful if CPU-bound.
GPU Culling:
- .
- All culling passes would be a single pass, for performance reasons.

Etc

Once we have a renderer where everything is stored in big GPU buffers, and we don’t use PushConstants or descriptor sets per object, we are ready to go with a GPU-driven-renderer.
Because it takes its parameters from a buffer, it is possible to use compute shaders to write into these buffers and do culling or LOD selection in compute shaders. Doing culling this way is one of the simplest and most performant ways of doing culling. Due to the power of the GPU you can easily expect to cull more than a million objects in less than half a millisecond. Normal scenes don’t tend to go as far. In more advanced pipelines like the one in Dragon Age or Rainbow Six, they go one step further and also cull individual triangles from the meshes. They do that by writing an output Index Buffer with the surviving triangles and using indirect to draw that.
Store the matrices for all loaded objects into a big SSBO. In GPU driven pipelines, we also want to store more data, such as material ID and cull bounds.
GPU driven pipelines work best when the amount of binds is as limited as possible. Best case scenario is to do a extremely minimal amount of BindVertexBuffer, BindIndexBuffer, BindPipeline, and BindDescriptorSet calls.
The less drawcalls you use to render your scene, the better, as modern GPUs are really big and have a big ramp up/ramp down time. Big modern GPUs love when you give them massive amounts of work on each drawcall, as that way they can ramp up to 100% usage.
The new Unreal 5 engine relies heavily on compute shaders for software rasterization.
The first thing is to go all in on object data in GPU buffers. Per-object PushConstants are removed, per-object dynamic uniform buffers are removed, and everything is replaced by ObjectBuffer where we store the object matrix and we index into it from the shader.
A Batch is a set of objects that matches material and mesh. Each batch will be rendered with one DrawIndirect call that does instanced drawing. Each mesh pass (forward pass, shadow pass, others) contains an array of batches which it will use for rendering.
When starting the frame, we sync the objects that are on each mesh pass into a buffer. This buffer will be an array of ObjectID + BatchID. The BatchID maps directly as an index into the batch array of the mesh-pass.
Once we have that buffer uploaded and synced, we execute a compute shader that performs the culling.
For every object in said array of ObjectID + BatchID pairs, we access the object data in the ObjectBuffer using the ObjectID, and check if it is visible. If it’s visible, we use the BatchID index to insert the draw into the Batches array, which contains the draw indirect calls, increasing the instance count. We also write it into the indirection buffer that maps from the instance ID of each batch into the ObjectID.
With that done, on the CPU side we iterate over the batches in a mesh pass, and execute each of them in order, making sure to bind each batch pipeline and material descriptor set. The gpu will then use the parameters it just wrote into from the culling pass to render the objects.
Buffer Storage.
Direct State Access.
Shader Buffer Load.
UBO & SSBO.
and more...

Avoid being Vertex Bound

.
- TBDR: Tile Based Deferred Renderer GPU.
- TBIR: Tile Based Immediate Renderer GPU.
MeshOptimizer .
- When a GPU renders triangle meshes, various stages of the GPU pipeline have to process vertex and index data. The efficiency of these stages depends on the data you feed to them; this library provides algorithms to help optimize meshes for these stages, as well as algorithms to reduce the mesh complexity and storage overhead.
- The library provides a C and C++ interface for all algorithms; you can use it from C/C++ or from other languages via FFI (such as P/Invoke).
- gltfack .
  - gltfpack is a tool that can automatically optimize glTF files to reduce the download size and improve loading and rendering speed.

Instancing

Billboard Grass and GPU Instancing .
- "Rendering millions of grass".
- The video is cool.
- We'll use a compute shader.
- We take the thread id of our compute shader thread. For a 300 square space, we can do position = id.xy - 150 so it's centered over the origin.
- As our grass is made of 3 meshes (3 billboard quads), this will result in 3 separate instancing calls.
- To increase the density of grasses in the square space, we can position *= (1 / density) ; I'll use density = 2 .
- For this example, I'll render 2,160,000 triangles at 523 fps.
  - Screen: 1289x621
  - Setup? GTX 1660, apparently.
- To get randomness, I did pos.xz += noise() for the position and position.y += noise() to get a different height (higher grass will be grouped with higher grasses).
  - This uses a simplex noise.
- To animate, I'll just skew the top 2 vertices of the mesh in the vertex shader.
  - {9:20} Explanation of what was done to randomize the intensity and frequency of grass sway movement.
  - Hash the instance id to get a hash id. With the hash_id, we check against a threshold to know if we perform a fast cosine or a slow cosine.
  - The grass height changed the cosine frequency.
  - Etc.
- To get the grass displaced with the terrain displacement, we convert the space coordinates of our grass to uv coordinates, such as they can sample the same height map as the terrain mesh.
  - .
- I also introduced a color variance in the grass, by making the tip of the grass more yellow, to show aging of the grass. This affects higher grasses a bit more, as they are older.
- Continuation: Grass Mesh instead of Billboarding and GPU Culling with the Scan and Compact technique .
  - The video is the continuation of the previous one, changing: noise texture for wind simulation and optimizations with frustum culling with Scan and Compact.
    - The video is cool, though it has less content than the previous one.
    - It shows no implementation or formulas.
  - Noise is used for wind movement, more sophisticated using a noise texture with oscillations ~etc, different from the randomization used in the billboarding solution.
  - Then, optimization is discussed. This new technique is much heavier than the previous, so he uses Frustum Culling to optimize what is actually instanced.
    - This frustum culling apparently is harder to do, because it requires the Compute Buffer array to be contiguous.
    - He uses: Scan and Compact.
    - The array is scanned, marked for modification and a new array is created with the desired elements.
    - For this, a Prefix Sum Scan is used to define which entries form the new array.
      - This is not explained.
- Continuation: LOD with Chunking .
  - The video is quite short, only talking about the optimization below. It shows no implementation or formulas.
  - It uses a mesh with lower poly count when the object's distance to the camera is greater than a threshold.
  - The justification for using chunking is to help with LOD.
    - Apparently LOD would be done in chunks (?) which would reduce the need for having a second position buffer, specific to the low-poly LOD mesh.
  - The final performance is 110fps, with 408MB of VRAM.

Scene Management

Google "GPU Scene Management" for some ideas - BVH, scene graph, ECS on the GPU, etc.
It has a concept of an "object" that users can place in the world.
These objects can contain multiple meshes and have a bounding box.
There is hierarchy (refer to flecs queries on how to do it efficiently).
Streaming is handled.

Culling

Godot 4 - Occlusion Culling .
Hierarchical Z-Buffer Culling - 2010 .
- Rendering with Conviction - GDC 2010 presented this technique, which was first introduced on Siggraph 2008.
- Hierarchical Z-Buffer Culling - Shadows - 2010 .
- Hierarchical Z-Buffer Culling - Generating Occlusion Volumes - 2011 .

Frustum Culling

Frustum Culling - Kohi Engine .
Tessellation Shader with LOD, Frustum Culling in the Geometry Shader .
- The video is pretty cool, but shows absolutely no code or formulas. It's only a theoretical discussion of the techniques.
- Creating geometry in the tessellation shader is better than passing the mesh on the GPU.
  - Data communication between CPU and GPU will always be the bottleneck.
  - This is not that precise, as the performance can actually be worse.
  - The real performance gain comes from tessellating based on the distance of the object, by using LOD; if it's far away, tessellate less.
- The Frustum Culling is based on the bounding box of the mesh, but as the mesh is too big, we are always rendering every triangle of the mesh, even for triangles we don't see.
- The Geometry Shader can finalize the geometry, as well as culling the triangles we don't need.
- This technique is usually used for terrain, as it's displaced by a height map.
- For characters, you would use different models with different poly count, rendering the correct one based on the distance to the camera.

CPU Frustum Culling :

The way this works is that we transform each of the 8 corners of the mesh-space bounding box into screen space, using the object matrix and view-projection matrix. From those, we find the screen-space box bounds, and we check if that box is inside the clip-space view. This way of calculating bounds is on the slow side compared to other formulas, and can have false-positives where it thinks objects are visible when they aren't. All the functions have different tradeoffs, and this one was selected for code simplicity and parallels with the functions we are doing on the vertex shaders.
We check for visibility before drawing.

bool is_visible(const RenderObject& obj, const glm::mat4& viewproj) {
    std::array<glm::vec3, 8> corners {
        glm::vec3 { 1, 1, 1 },
        glm::vec3 { 1, 1, -1 },
        glm::vec3 { 1, -1, 1 },
        glm::vec3 { 1, -1, -1 },
        glm::vec3 { -1, 1, 1 },
        glm::vec3 { -1, 1, -1 },
        glm::vec3 { -1, -1, 1 },
        glm::vec3 { -1, -1, -1 },
    };

    glm::mat4 matrix = viewproj * obj.transform;

    glm::vec3 min = { 1.5, 1.5, 1.5 };
    glm::vec3 max = { -1.5, -1.5, -1.5 };

    for (int c = 0; c < 8; c++) {
        // project each corner into clip space
        glm::vec4 v = matrix * glm::vec4(obj.bounds.origin + (corners[c] * obj.bounds.extents), 1.f);

        // perspective correction
        v.x = v.x / v.w;
        v.y = v.y / v.w;
        v.z = v.z / v.w;

        min = glm::min(glm::vec3 { v.x, v.y, v.z }, min);
        max = glm::max(glm::vec3 { v.x, v.y, v.z }, max);
    }

    // check the clip space box is within the view
    if (min.z > 1.f || max.z < 0.f || min.x > 1.f || max.x < -1.f || min.y > 1.f || max.y < -1.f) {
        return false;
    } else {
        return true;
    }
}

VkGuide:

The instanceBuffer is AllocatedBuffer instanceBuffer; from the last article. It stores ObjectID + BatchID (draw indirect ID)

bool IsVisible(uint objectIndex)
{
    //grab sphere cull data from the object buffer
    vec4 sphereBounds = objectBuffer.objects[objectIndex].spherebounds;

    vec3 center = sphereBounds.xyz;
    center = (cullData.view * vec4(center,1.f)).xyz;
    float radius = sphereBounds.w;

    bool visible = true;

    //frustrum culling
    visible = visible && center.z * cullData.frustum[1] - abs(center.x) * cullData.frustum[0] > -radius;
    visible = visible && center.z * cullData.frustum[3] - abs(center.y) * cullData.frustum[2] > -radius;

    if(cullData.distCull != 0)
    {// the near/far plane culling uses camera space Z directly
        visible = visible && center.z + radius > cullData.znear && center.z - radius < cullData.zfar;
    }

    visible = visible || cullData.cullingEnabled == 0;

    return visible;
}

void main()
{
    uint gID = gl_GlobalInvocationID.x;
    if(gID < cullData.drawCount)
    {
        //grab object ID from the buffer
        uint objectID = instanceBuffer.Instances[gID].objectID;

        //check if object is visible
        bool visible  = IsVisible(objectID);

        if(visible)
        {
            //get the index of the draw to insert into
            uint batchIndex = instanceBuffer.Instances[gID].batchID;

            //atomic-add to +1 on the number of instances of that draw command
            uint countIndex = atomicAdd(drawBuffer.Draws[batchIndex].instanceCount,1);

            //write the object ID into the instance buffer that maps from gl_instanceID into ObjectID
            uint instanceIndex = drawBuffer.Draws[batchIndex].firstInstance + countIndex;
            finalInstanceBuffer.IDs[instanceIndex] = objectID;
        }
    }
}

Cluster

.
- Each cluster is up to 64 vertices / 124 triangles.
.
.
.
.
.
Cluster Culling Shader .
- VK_HUAWEI_cluster_culling_shader .

MeshOptimizer

From what I saw in the code below, it seems to have several interesting optimizations.
ogldev\DemoLITION\Framework\Source\core_model.cpp:492
It's the last thing done when creating the mesh.

template<typename VertexType>
void CoreModel::OptimizeMesh(int MeshIndex, std::vector<uint>&Indices, std::vector<VertexType>&Vertices, std::vector<VertexType>& AllVertices)
{
    size_t NumIndices = Indices.size();
    size_t NumVertices = Vertices.size();
    // Create a remap table
    std::vector<unsigned int> remap(NumIndices);
    size_t OptVertexCount = meshopt_generateVertexRemap(remap.data(),    // dst addr
                                                        Indices.data(),  // src indices
                                                        NumIndices,      // ...and size
                                                        Vertices.data(), // src vertices
                                                        NumVertices,     // ...and size
                                                        sizeof(VertexType)); // stride
    // Allocate a local index/vertex arrays
    std::vector<uint> OptIndices;
    std::vector<VertexType> OptVertices;
    OptIndices.resize(NumIndices);
    OptVertices.resize(OptVertexCount);
    // Optimization #1: remove duplicate vertices    
    meshopt_remapIndexBuffer(OptIndices.data(), Indices.data(), NumIndices, remap.data());
    meshopt_remapVertexBuffer(OptVertices.data(), Vertices.data(), NumVertices, sizeof(VertexType), remap.data());
    // Optimization #2: improve the locality of the vertices
    meshopt_optimizeVertexCache(OptIndices.data(), OptIndices.data(), NumIndices, OptVertexCount);
    // Optimization #3: reduce pixel overdraw
    meshopt_optimizeOverdraw(OptIndices.data(), OptIndices.data(), NumIndices, &(OptVertices[0].Position.x), OptVertexCount, sizeof(VertexType), 1.05f);
    // Optimization #4: optimize access to the vertex buffer
    meshopt_optimizeVertexFetch(OptVertices.data(), OptIndices.data(), NumIndices, OptVertices.data(), OptVertexCount, sizeof(VertexType));
    // Optimization #5: create a simplified version of the model
    float Threshold = 1.0f;
    size_t TargetIndexCount = (size_t)(NumIndices * Threshold);
    float TargetError = 0.0f;
    std::vector<unsigned int> SimplifiedIndices(OptIndices.size());
    size_t OptIndexCount = meshopt_simplify(SimplifiedIndices.data(), OptIndices.data(), NumIndices,
                                            &OptVertices[0].Position.x, OptVertexCount, sizeof(VertexType), TargetIndexCount, TargetError);
    static int num_indices = 0;
    num_indices += (int)NumIndices;
    static int opt_indices = 0;
    opt_indices += (int)OptIndexCount;
    printf("Num indices %d\n", num_indices);
    //printf("Target num indices %d\n", TargetIndexCount);
    printf("Optimized number of indices %d\n", opt_indices);
    SimplifiedIndices.resize(OptIndexCount);
    // Concatenate the local arrays into the class attributes arrays
    m_Indices.insert(m_Indices.end(), SimplifiedIndices.begin(), SimplifiedIndices.end());
    AllVertices.insert(AllVertices.end(), OptVertices.begin(), OptVertices.end());
    m_Meshes[MeshIndex].NumIndices = (uint)OptIndexCount;
}

Draco

Draco .
Draco is a library for compressing and decompressing 3D geometric meshes and point clouds. It is intended to improve the storage and transmission of 3D graphics.
By Google.

LOD, MipMap

Chunking, LODs and Fog .
Mipmap, Minification Filters, Magnification Filters .
- MIP: Latin for 'Much In Little'.
  - It downsizes the image in powers of two.
- {5:31 -> 10:10}
  - Explanation of the Filters
  - Minification:
    - Nearest, Linear, Linear_mipmap_nearest and Linear_mipmap_linear.
  - Magnification:
    - Nearest, Linear.
- {10:14 -> end}
  - GLSL implementation.
- {19:29 -> 21:02}
  - Differences in the Filters.
Mipmap and Trilinear Filter .
- Trilinear Filtering: Linear interpolate between levels of mipmapping.
Improving the mipmap for transparent objects at a distance (foliage) - The Witness .
- The idea is very simple: modify the content of the mipmap during its creation.
- It shows a formula to manipulate the final result to get a better look.
- That's all.
- I found it interesting.
Using the full chain :
- Advantages :
  - Hardware trilinear/anisotropic filtering benefits from having all levels available → better quality when minifying.
  - Simplifies generation: many GPU/CPU mip-generation algorithms assume a full chain.
  - No runtime fallback behavior to a coarser final level.
- Disadvantages :
  - Increased memory and upload cost (sum of sizes of all mip levels).
  - Extra work to generate or upload every level (unless you generate on GPU).
Lower mip levels (1 < mipLevels < max) :
- Advantages :
  - Lower memory and upload cost.
  - Useful for streaming: allocate only top K levels now, stream lower-res later.
  - Useful for textures that will rarely be minified (UI element, near camera).
- Disadvantages :
  - Potentially poorer filtering when the sampler requests a lower LOD; sampling will effectively use the last available level (coarser detail).
  - If you plan to GPU-blit/generate mips, you must still have declared the target number of levels ahead of generation.
  - Some runtime tools/algorithms may assume a full chain and need adaptation.

Mesh: LOD

Godot 4 - Mesh LOD .
- Godot provides a way to automatically generate less detailed meshes for LOD usage on import, then use those LOD meshes when needed automatically. This is completely transparent to the user. The meshoptimizer library is used for LOD mesh generation behind the scenes.

Image Formats

KTX2

Is a container file format for storing texture data optimized for GPU usage. It’s designed to work efficiently with modern graphics APIs like Vulkan, OpenGL, and DirectX.
KTX2 vs PNG .
KTX Viewer Online .

Dynamic Resolution

Tiling-based / VRS / Nanite

Variable Rate Shading (VRS)

With VRS you can specify different sampling rates for different parts of the screen. This can be used to optimize performance for either adapting the shading rate to the content, or for adapting the shading rate for things like foveated rendering in VR, where you only need full shading rate at the center of the viewport.
Demo .
VRS - Wicked Engine .
VRS First Impressions - Wicked Engine .
VRS seems to work better with Forward than Deferred Rendering.
Godot 4 - VRS .

Tiling Post-Processing

Tiled Based Optimization for Post-Processing - Wicked Engine .

Nanite

Comparing Nanite to LODs .
Baz - Nanite is still a deferred renderer, but I don't think Forward or Deferred Renderer is the right choice.
.

Software Based Rasterization

High-Performance Software Rasterization on GPUs by Laine and Karras.
- That paper describes an all-compute rendering pipeline for the traditional 3D triangle workload. The architecture calls for sorting in the middle of the pipeline, so that in the early stage of the pipeline, triangles can be processed in arbitrary order to maximally exploit parallelism, but the output render still correctly applies the triangles in order.
- In 3D rendering, you can almost get away with unsorted rendering, relying on Z-buffering to decide a winning fragment, but that would result in “Z-fighting” artifacts and also cause problems for semitransparent fragments.
- Goals :
  - Our endeavor has multiple goals. First, we want to establish a firm data point of the performance of a state-of-the-art GPU software rasterizer compared to the hardware pipeline. We maintain that only a careful experiment will reveal the performance difference, as without an actual implementation there are too many unknown costs. Second, constructing a purely software-based graphics pipeline opens the opportunity to augment it with various extensions that are impossible or infeasible to fit in the hardware pipeline (without hardware modifications, that is). For example, programmable ROP calculations, trivial non-linear rasterization (e.g., [Gascuel et al. 2008]), fragment merging [Fatahalianet al. 2010], stochastic rasterization [Akenine-M¨oller et al. 2007] with decoupled sampling [Ragan-Kelley et al. 2011], etc., could be implemented as part of the programmable pipeline. Thirdly, by identifying the hot spots in our software pipeline, we hope to illuminate future hardware that would be better suited for fully programmable graphics. The complexity and versatility of the hardware graphics pipeline does not come without costs in design and testing. In an ideal situation, just a few hardware features targeted at accelerating software-based graphics would be enough to obtain decent performance, and the remaining gap would be closed by faster time-to-market and reduced design costs.
2D :
- Sort-middle architecture - Raph Levien .
  - Not so easy to understand.
- Fast 2D rendering - Raph Levien .
  - Not so easy to understand.